Remove premature ASSERT on gray failure status check in ClogRemoteTLog#13301
Remove premature ASSERT on gray failure status check in ClogRemoteTLog#13301saintstack wants to merge 1 commit into
Conversation
The gray failure status check asserted that the clogged remote TLog must appear in the status JSON immediately upon being excluded from dbInfo. However, gray failure detection has inherent latency: worker health monitors must detect the degraded peer, report to the cluster controller, and the CC must populate the status response. With certain seeds the test loop detects the TLog exclusion from dbInfo before the gray failure pipeline has propagated to the status JSON, causing a spurious assertion failure. Remove the ASSERT and let the test loop retry the status check on subsequent iterations. The core test validation (state path matching) is unaffected since the state transitions to CLOGGED_REMOTE_TLOG_EXCLUDED regardless of the status check result. Fixes: ClogRemoteTLog.toml simulation failure with seed 428562693 and gcc
There was a problem hiding this comment.
Pull request overview
This PR adjusts the ClogRemoteTLog simulation workload to avoid a flaky failure caused by asserting on gray-failure status JSON too early after a remote TLog is excluded from dbInfo, acknowledging the inherent propagation latency in the gray-failure reporting pipeline.
Changes:
- Removed an immediate
ASSERT(statusCheckPassed)after callinggrayFailureStatusCheck(). - Allows the workload loop to continue and retry the gray-failure status check on later iterations.
| localState = TestState::CLOGGED_REMOTE_TLOG_EXCLUDED; | ||
| if (!statusCheckPassed) { | ||
| statusCheckPassed = co_await grayFailureStatusCheck(db, self->cloggedRemoteTLog.get()); | ||
| ASSERT(statusCheckPassed); | ||
| } |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
There was a problem hiding this comment.
The gray failure status check asserted that the clogged remote TLog must appear in the status JSON immediately upon being excluded from dbInfo. However, gray failure detection has inherent latency: worker health monitors must detect the degraded peer, report to the cluster controller, and the CC must populate the status response.
I need to recollect my memory around this (it's been a while). But my understanding is that there's a causal relationship here that this statement is missing.
The causal order of events is:
- CC excludes a worker W based on complaints
- CC updates its in-memory state (this is what status actor uses)
- Recovery is triggered (because W was in txn subsystem)
- We don't recruit W during recovery
- Post recovery, the new serverdbinfo does not have W
So "being excluded from dbinfo" i.e. (5) implies that (2) has happened, which implies that status json should have gray_failure field and W in it.
Furthermore, the test assert is under this condition:
remoteTLogNotInDbInfo(self->cloggedRemoteTLog.get(), self->dbInfo->get()))
Continuing the events from above:
6. CC broadcasts new serverdbinfo, that eventually reaches the tester client (running the workload)
So the if remote tlog (W) was not in the new serverdbinfo, that means CC should have it in excluded set, and therefore status json should return it.
Perhaps there is another edge case because of which W is not in status json but I don't understand the serverdbinfo reasoning. For example, if sequencer reports W as unhealthy before gray failure in CC could make the decision, recovery would still be triggered, just not via gray failure. In that case, the test condition would be true (W is not in new serverdbinfo), but since this was not gray failure triggered recovery, status json won't show W. Maybe something along those lines is happening.
|
Thanks for helpful review @spraza ... digesting.... |
The gray failure status check asserted that the clogged remote TLog must appear in the status JSON immediately upon being excluded from dbInfo. However, gray failure detection has inherent latency: worker health monitors must detect the degraded peer, report to the cluster controller, and the CC must populate the status response. With certain seeds the test loop detects the TLog exclusion from dbInfo before the gray failure pipeline has propagated to the status JSON, causing a spurious assertion failure.
Remove the ASSERT and let the test loop retry the status check on subsequent iterations. The core test validation (state path matching) is unaffected since the state transitions to CLOGGED_REMOTE_TLOG_EXCLUDED regardless of the status check result.
Fixes: ClogRemoteTLog.toml simulation failure with seed 428562693 and gcc